Generalized Topic Modeling
نویسندگان
چکیده
Recently there has been significant activity in developing algorithms with provable guarantees for topic modeling. In standard topic models, a topic (such as sports, business, or politics) is viewed as a probability distribution ai over words, and a document is generated by first selecting a mixture w over topics, and then generating words i.i.d. from the associated mixture Aw. Given a large collection of such documents, the goal is to recover the topic vectors and then to correctly classify new documents according to their topic mixture. In this work we consider a broad generalization of this framework in which words are no longer assumed to be drawn i.i.d. and instead a topic is a complex distribution over sequences of paragraphs. Since one could not hope to even represent such a distribution in general (even if paragraphs are given using some natural feature representation), we aim instead to directly learn a document classifier. That is, we aim to learn a predictor that given a new document, accurately predicts its topic mixture, without learning the distributions explicitly. We present several natural conditions under which one can do this efficiently and discuss issues such as noise tolerance and sample complexity in this model. More generally, our model can be viewed as a generalization of the multi-view or co-training setting in machine learning. ∗Supported in part by National Science Foundation grants CCF-1525971 and CCF-1535967. †Supported in part by National Science Foundation grant CCF-1525971 and by a Microsoft Research Graduate Fellowship and an IBM Ph.D Fellowship. ar X iv :1 61 1. 01 25 9v 1 [ cs .L G ] 4 N ov 2 01 6
منابع مشابه
Temporal expert finding through generalized time topic modeling
Please cite this article in press as: A. Daud et al., j.knosys.2010.04.008 This paper addresses the problem of semantics-based temporal expert finding, which means identifying a person with given expertise for different time periods. For example, many real world applications like reviewer matching for papers and finding hot topics in newswire articles need to consider time dynamics. Intuitively...
متن کاملShort and Sparse Text Topic Modeling via Self-Aggregation
The overwhelming amount of short text data on social media and elsewhere has posed great challenges to topic modeling due to the sparsity problem. Most existing attempts to alleviate this problem resort to heuristic strategies to aggregate short texts into pseudo-documents before the application of standard topic modeling. Although such strategies cannot be well generalized to more general genr...
متن کاملAutomatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation
Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...
متن کاملConference Mining via Generalized Topic Modeling
Conference Mining has been an important problem discussed these days for the purpose of academic recommendation. Previous approaches mined conferences by using network connectivity or by using semantics-based intrinsic structure of the words present between documents (modeling from document level (DL)), while ignored semantics-based intrinsic structure of the words present between conferences. ...
متن کاملThe Generalized Metastable Switch Memristor Model
Memristor device modeling is currently a heavily researched topic and is becoming ever more important as memristor devices make their way into CMOS circuit designs, necessitating accurate and efficient memristor circuit simulations. In this paper, the Generalized Metastable Switch (MSS) memristor model is presented. The Generalized MSS model consists of a voltage-dependent stochastic component ...
متن کاملGeneralized multibaker maps exhibiting transient diffusion
Generalized multibaker maps are introduced to study properties of determin-istic diffusion. Emphasis is put on transient diffusion modeling systems which are spatially extended only in certain directions and escape of particles is allowed in other ones. Effects of nonlinearity are investigated by varying a control parameter.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1611.01259 شماره
صفحات -
تاریخ انتشار 2016